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Abstract 

In many global optimization problems motivated by engineering applications, the number 
of function evaluations is severely limited by time or cost. To ensure that each evaluation 
contributes to the localization of good candidates for the role of global minimizer, a se- 
quential choice of evaluation points is usually carried out. In particular, when Kriging is 
used to interpolate past evaluations, the uncertainty associated with the lack of information 
on the function can be expressed and used to compute a number of criteria accounting for 
the interest of an additional evaluation at any given point. This paper introduces minimizer 
entropy as a new Kriging-based criterion for the sequential choice of points at which the 
function should be evaluated. Based on stepwise uncertainty reduction, it accounts for the 
informational gain on the minimizer expected from a new evaluation. The criterion is ap- 
proximated using conditional simulations of the Gaussian process model behind Kriging, 
and then inserted into an algorithm similar in spirit to the Efficient Global Optimization 
(EGO) algorithm. An empirical comparison is canied out between our criterion and ex- 
pected improvement, one of the reference criteria in the literature. Experimental results 
indicate major evaluation savings over EGO. Finally, the method, which we call lAGO 
(for Informational Approach to Global Optimization) is extended to robust optimization 
problems, where both the factors to be tuned and the function evaluations are corrupted by 
noise. 

Key words: Gaussian process, global optimization, Kriging, robust optimization, stepwise 
uncertainty reduction 



1 Introduction 



This paper is devoted to global optimization in a context of expensive function eval- 
uation. The objective is to find global minimizers in X (the factor space, a bounded 



subset of W^) of an unknown function / : X ^ R, using a very limited number of 
function evaluations. Note that the global minimizer may not be unique (any global 
minimizer will be denoted as x*). Such a problem is frequently encountered in 
the industrial world. For instance, in the automotive industry, optimal crash-related 
parameters are obtained using costly real tests and time-consuming computer sim- 
ulations (a single simulation of crash-related deformations may take up to 24 hours 
on dedicated servers). It then becomes essential to favor optimization methods that 
use the dramatically scarce information as efficiently as possible. 

To make up for the lack of knowledge on the function, surrogate (also called meta 
or approximate) models are used to obtain cheap approximations (Jones [2001]). 
They turn out to be convenient tools for visualizing the function behavior or sug- 
gesting the location of an additional point at which / should be evaluated in the 
search for x*. Surrogate models based on Gaussian processes have received par- 
ticular attention. Known in Geostatistics under the name of Kriging since the early 
1960s (Matheron [1963]), Gaussian process models provide a probabilistic frame- 
work to account for the uncertainty stemming from the lack of information on the 
system. When dealing with an optimization problem, this framework allows the set 
of function evaluations to be chosen efficiently (Jones [2001], Jones et al. [1998], 
Huang et al. [2006]). 

In this context, several strategies have been proposed, with significant advantages 
over traditional optimization methods when confronted to expensive-to-evaluate 
functions. Most of them implicitly seek a likely value for x*, and then assume it 
to be a suitable location for a new evaluation of /. Yet, given existing evaluation 
results, the most likely location of a global minimizer is not necessarily a good eval- 
uation point to improve our knowledge on x* . As we shall show, by making full use 
of Kriging, it is instead possible to explicitly estimate the probability distribution of 
the optimum location, which allows an information-based search strategy. 

Based on these observations, the present paper introduces minimizer entropy as a 
criterion for the choice of new evaluation points. This criterion, directly inspired 
from stepwise uncertainty reduction (Geman and Jedynak [1995]), is then inserted 
in an algorithm similar to the Efficient Global Optimization (EGO) algorithm (Jones 
et al. [1998]). We call the resulting algorithm lAGO, for Informational Approach 
to Global Optimization. 

Section 2 recalls the principle of Kriging-based optimization, along with some gen- 
eral ideas on Gaussian process modeling that are used in Section 3 to build an es- 
timate of the distribution of the global minimizers. Section 4 details the stepwise 
uncertainty reduction approach applied to global optimization, while Section 5 de- 
scribes the corresponding algorithm and its extensions to noisy problems. Section 6 
illustrates the behavior of the new algorithm on some simple benchmark problems, 
along with its performances compared with those of the classical EGO algorithm, 
chosen for its good compromise between local and global search (Sasena et al. 
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[2002]). Finally, after a conclusion section and to make this paper self-contained, 
Section 8 presents, as an appendix, some more results on Gaussian process model- 
ing and Kriging. 



2 Kriging-based global optimization 

When dealing with expensive-to-evaluate functions, optimization methods based 
on probabilistic surrogate models (and Kriging in particular) have significant ad- 
vantages over traditional optimization techniques, as they require fewer function 
evaluations. Kriging can indeed provide a cheap and accurate approximation of the 
function, but also an estimate of the potential error in this approximation. Numerous 
illustrations of this superiority can be found in the literature (see, for instance. Cox 
and John [1997]) and many variations have been explored (for extensive surveys, 
see Jones [2001] and Sasena et al. [2002]). As explained in this section, these meth- 
ods deal with the cost of evaluation using an adaptive sampling strategy, replacing 
the optimization of the expensive-to-evaluate function / by a series of optimiza- 
tions of a cheap criterion. 



2.1 Gaussian process modeling and Kriging 

This section briefly recalls the principle of Gaussian process (GP) modeling, and 
lays down the necessary notation. A more detailed presentation is available in the 
appendix (Section 8). 

When modeling with Gaussian processes, the function / is assumed to be a sample 
path of a second-order Gaussian random process F. If we denote (Vt, A, V) the 
underlying probability space, this amounts to assuming that 3 G ^2, such that 
F{ijj, ■) = /(■). Whenever possible, we shall omit the dependence of F in to 
simplify notation. 

In particular, given a set of n evaluation points S = {xi, . . . , Xn} (the design), 
\/Xi e S the evaluation result f{xi) is viewed as a sample path of the random vari- 
able F{xi). Kriging computes an unbiased linear predictor of F(x) in the vector 
space E[§ = span{F(a;i), . . . , F{xn)}, which can be written as 

F{x) = X{xyFs, (1) 

with Fs = [F{xi), . . . , F{xn)y , and X{x) the vector of Kriging coefficients for 
the prediction at x. 

Given the covariance of F, the Kriging coefficients can be computed along with the 
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variance of the prediction error 

a^{x) = vaT{F{x) - F{x)). (2) 

The covariance of F is chosen within a class of parametrized covariances (for in- 
stance, the Matem class), and its parameters are either estimated from the data or 
chosen a priori (see Section 8.3.2 for details on the choice of a covariance). 

Once / has been evaluated at all evaluation points in §, the predicted value of / at 
X is given by 

fix) = XixYfs , (3) 
with /§ = [f{xi), . . . , f{xn)]^ (fs is viewed as a sample value of Fg). The same 
results could be derived in a Bayesian framework, where F{x) is Gaussian con- 
ditionally to the evaluations carried out (F^ = /§), with mean f{x) and variance 
a'ix). 

Note that 

yxi e §, F{xi) = F{x,), (4) 

and that the prediction at Xi E S is f{xi). When / is assumed to be evaluated ex- 
actly, Kriging is thus an interpolation, with the considerable advantage over other 
interpolation methods that it also provides an explicit characterization of the pre- 
diction error (zero-mean Gaussian with variance cr'^{x)). 

2.2 Adaptive sampling strategies 

The general principle of optimization using Kriging is iteratively to evaluate / at 
a point that optimizes a criterion based on the model obtained using previous eval- 
uation results. The simplest approach would be to choose a minimizer of the pre- 
diction / as a new evaluation point. However, by doing so, too much confidence 
would be put in the current prediction and search is likely to stall on a local opti- 
mum (as illustrated by Figure 1). To compromise between local and global search, 
more emphasis has to be put on the prediction error that can indicate locations 
where additional evaluations are needed to improve confidence in the model. This 
approach has led to a number of criteria, based on both prediction and prediction 
error, designed to select additional evaluation points. 

A standard example of such a criterion is expected improvement (EI). As the name 
suggests, it involves computing how much improvement in the optimum is ex- 
pected, if / is evaluated at a given additional point. Let F be the Gaussian pro- 
cess model, as before, and /min be the current best function value obtained. The 
improvement expected from an additional evaluation of / at a; given /§, the results 
of past evaluations, can then be expressed as 

EI(a;) = E [max (/^in - F {x) , 0) \Fg = h] . 
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Fig. 1. Naive approach to optimization using Kriging: (top) prediction / (bold Une) of the 
true function / (dotted line, supposedly unknown) obtained from an initial design mate- 
rialized by squares; (bottom) prediction after seven iterations of a direct minimization of 
/■ 

Since F(x) is conditionally Gaussian with mean f(x) and variance a^(a;), a con- 
venient expression appears (Schonlau [1997]): 



El{x) 



<T[X) M$(m) + — (m) 

du 



(5) 



with 
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and $ the normal cumulative distribution function. The new evaluation point is then 
chosen as a global maximizer of EI(£c). An example is given on Figure 2, where the 
problem that deceived the naive method of Figure 1 is directly solved with the EI 
criterion. First introduced in (Jones et al. [1998]), this method has been used for 
computer experiments in (Sasena et al. [2002]), while modified criteria have been 
used in (Huang [2005]) and (Williams et al. [2000]) to deal with noisy functions. 

In (Jones [2001]) and (Watson and Barnes [1995]), a fair number of alternative 
criteria are presented and compared. Although quite different in their formulation, 
they generally aim to answer the same question: What is the most likely position of 
x*l Another, and probably more relevant, question is: Where should the evaluation 
be carried out optimally to improve our knowledge on the global minimizers? 



In what follows, a criterion that addresses this question will be presented, along 



with its performances. The reference for comparison will be EI, which is, according 
to Sasena et al. [2002], a reasonable compromise between local and global search, 
and has been successfully used in many applications. 



3 Estimating the density of a;* 

Once a Kriging surrogate model / has been obtained, any global minimizer of / 
is a natural approximation of x*. However, it might be excessively daring to trust 
this approximation as it does not take in account the uncertainty of the prediction. 
A more cautious approach to estimating x* is to use the probabilistic framework 
associated with F. Of course, x* is not necessarily unique, and we shall focus on 
describing the set of all global minimizers of / as efficiently as possible. 

3. 1 Probabilistic modeling of the global minimizers of f 

According to the GP model, a global minimizer x* of / corresponds to a global 
minimizer of this particular sample path of F. More formally, consider the random 
set Aix of the global minimizers of F over X, i.e. the set of all global minimizers 
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Fig. 2. EI approach to optimization using Kiiging: (top) f (bold line), 95% confidence in- 
tervals computed using a (dashed line) and true function / (dotted line); (bottom) expected 
improvement. 
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for each sample path, which can be written as 



A^J(cu) = {x* e X\F{uj,x*) 



mm F{u!,u)}, Wu G fl, 



with f2 the sample set. To ensure that Ai^ is not empty, we assume that F has con- 
tinuous sample paths with probability one. This continuity can be ensured through 
a proper choice of covariances (see, e.g., (Abrahamsen [1997])). 

Let X* be a random vector uniformly distributed on Ai^- The probability density 
function of this random vector conditional to past evaluation results, that we shall 
thereafter call conditional density of the global minimizers, is of great interest, as 
it allows one not only to estimate the global minimizers of / (for example, through 
the maximization of their probability density function), but also to characterize the 
uncertainty associated with this estimation. In fact, given the results fg of previous 
evaluations, the probability density function px'ifyix) of X* conditionally to f§ 
contains all of what has been assumed and learned about the system. However, 
no tractable analytical expression for such a quantity is available (Adler [2000], 
Sjo [2000]). To overcome this difficulty, the approach taken here is to consider a 
discrete version of the conditional distribution, and to approximate it using Monte 
Carlo simulations. 

Let G = {a^i,...,iCjv}bea finite subset of X, tVI^ be the random set of global 
minimizers of F over G, and X^ be a random vector uniformly distributed on Ai^. 
The conditional probability mass function of Xq given fg is then Va; G G 



It can be approximated using conditional simulations, i.e., simulations of F that 
satisfy Fg = fg. Assuming that non-conditional simulations are available, several 
methods exist to make them conditional (Chiles and Delfiner [1999]). Conditioning 
by Kriging seems the most promising of them in the present context and will be 
presented in the next section. 



3.2 Conditioning by Kriging 

This method, due to G. Matheron, uses the unbiasedness of the Kriging prediction 
to transform non-conditional simulations into simulations interpolating the results 
fg of the evaluations. Let Z be a zero-mean Gaussian process with covariance k (the 
same as that of F) and Z be its Kriging predictor based on the random variables 
Z{xi), Xi G S, and consider the random process 



Vxi\k{x) = P(X* =x\Fg = fg) . 



T{x) = fix) + Z{x) - Z{x) , 



(6) 
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where / is the mean of the Kriging predictor based on the design points in S. 
Since this Kriging predictor is an interpolator, at evaluation points in §, we have 
f{xi) = f{xi). Equation (4) implies that Z{xi) = Z(xi), which leads to T{xi) = 
f{xi), \fxi G S. In other words, T is such that all its sample paths interpolate the 
known values of /. It is then easy to check that T has the same finite-dimension 
distributions as F conditionally to past evaluation results (Delfiner [1977]). Note 
that the same vector A(x) of Kriging coefficients is used to interpolate the data and 
the simulations at design points. Using (3), one can rewrite (6) as 

T{x) = Z{x) + X{xY [/s - Zs] , (7) 

withZ§ = 

In summary, to simulate F over G conditionally to past evaluation results /§, 
we can simulate a zero-mean Gaussian process Z over G, and use the following 
method: 

• Compute, for every point in G, the vector of Kriging coefficients based on the 
design points in S, 

• compute the Kriging prediction f{x) based on past evaluation results /g for 
every x in G, 

• sample over G non-conditional sample paths of Z (provided that a Gaussian 
sampler is available, setting the proper covariance can be achieved using, for 
example, the Cholesky decomposition), 

• apply (7) at every point in G. 

With this sampling method (see Figure 3 for an illustration), it becomes straight- 
forward to estimate Px*\fs- Let x* be a global minimizer of the z-th conditional 
simulation (i = 1, . . . , r) over G (if it is not unique, choose one randomly). Then, 
for any x in G the empirical probability mass distribution 

1 

Pxai/s(^) = -E^-rN' (8) 

with 5 the Kronecker symbol, tends almost surely towards px*\fs{x) as r tends to- 
wards infinity. Moreover, Px*\fs tends in distribution towards px*\fs G becomes 
dense in X. 

Figure 4 presents the approximation achieved by Px*,\fs for an example where lo- 
cating a global minimizer is not easy. Knowing the conditional distribution of 
gives valuable information on the areas of X where a global minimizer might be 
located, and that ought to be investigated. This idea will be detailed in the next 
section. 
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Fig. 3. Conditioning a simulation: (top) unknown real curve / (doted line), sample points 
(squares) and associated Kiiging prediction / (bold line); (middle) non-conditional simula- 
tion z, sample points and associated Kiiging prediction z (bold line); (bottom) the simula- 
tion of the Kriging error z — zis picked up from the non-conditional simulation and added 
to the Kriging prediction to get the conditional simulation (thin line). 

4 The stepwise uncertainty reduction strategy 

The knowledge about the global minimizers of / is summarized by the approxima- 
tion px*,\fs of the conditional probability mass function of X^. In order to evaluate 
the interest of a new evaluation of / at a given point, a measure of the expected in- 
formation gain is required. An efficient measure is conditional entropy, as used in 
sequential testing (Geman and Jedynak [1995]) in the Stepwise Uncertainty Reduc- 
tion (SUR) strategy. This section extends the SUR strategy to global optimization. 



4. 1 Conditional entropy 

The entropy of a discrete random variable U (expressed in bits) is defined as: 

u 

H{U) measures the spread of the distribution of U . It decreases as this distribution 
gets more peaked. In particular : 
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Fig. 4. Estimation of the conditional point mass distribution of X^: {top) Kriging interpo- 
lation, 95% confidence intervals and sample points; {bottom) estimated point mass distri- 
bution of using 10000 conditional simulations of F and a regular grid for G. 

. Pxi\k{x) = 1/iV Va; G G ^ H{Xl) = log2(iV), 
if x 7^ 



^ H(X^) = 



1 if 03 = ajo 



Similarly, for any event B, the entropy of U relative to the probability measure 

F{.\B) is 

H{U\B) = {U = u\B) \og^ P{U = u\B). 

u 

The conditional entropy of U given another discrete random variable V is 

H{U\V) = = '")H{U\V = v), 

V 

and the conditional entropy of U given B and V is 



H{U\B,V) = Y^^i^ = ^\^)H{U\B,V 



(9) 



More details on conditional entropy can be found in Cover and Thomas [1991]. 
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4.2 Conditional minimizer entropy 



Consider Fq{x) a discrete version of F{x), defined as Fq{x) = Q{F{x)) with 
Q a quantization operator. Q is characterized by a finite set of M real numbers 
{yi, . . . ^hm}, and defined as 



Vm e M Q{u) 



yi if n < yi 

yi if yi_i <u<yi^i e |2, M\ 

yu if yM < u 



For optimization problems, the SUR strategy for the selection of the next value of 
a; G X at which / will be evaluated will be based on H(X^\Fs = /§, Fq(x)), 
the conditional entropy of given Fq(x) and the evaluation results {Fg = /§} 
(we shall refer to it later on as conditional entropy of the minimizer, or simply 
conditional entropy). 



Using (9) we can write 



M 



HiX*^\Fn = fn^Fgix)) = ^P(FQ(a;) = y,\Fs = fn)HiX*^\Fn = h, Fgix) = y, 

i=l 

(10) 

with 



H{X^\Fs = h,FQ{x) = yi) = - Px*|/s,y,(u)log2 Px*|/s,y,(u) , 

and 

= P(^* = u\Fs = f§,FQ{x) = yi). 



H{X^\Fs = fs,FQ{x)) is a measure of the anticipated uncertainty remaining 
in X^ given the candidate evaluation point x and the result /§ of the previous 
evaluations. Anticipation is introduced in (10) by considering the entropy of X^ 
resulting from every possible sample value of Fq{x). At each stage of the iterative 
optimization, the SUR strategy retains for the next evaluation a point that minimizes 
the expected conditional minimizer entropy after the evaluation, i.e., a point that 
maximizes the expected gain in information about X^. 

The conditional entropy of the minimizer thus takes in account the conditional sta- 
tistical properties of F and particularly the covariance of the model. There lies the 
interest of the SUR strategy applied to global optimization. It makes use of what 
has been previously assumed and learned about / to pick up the most informative 
evaluation point. By contrast, the EI criterion (as most standard criteria) depends 
only on the conditional mean and variance of F at the design point considered. 
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5 Implementing the SUR strategy 



5. 1 lAGO algorithm 

Our algorithm is similar in spirit to a particular strategy for Kriging-based opti- 
mization known as Efficient Global Optimization (EGO) (Jones et al. [1998]). EGO 
starts with a small initial design, estimates the parameters of the covariance of F 
and computes the Kriging model. Based on this model, an additional point is se- 
lected in the design space to be the location of the next evaluation of / using the 
EI criterion. The parameters of the covariance are then re-estimated, the model re- 
computed, and the process of choosing new points continues until the improvement 
expected from sampling additional points has become sufficiently small. The lAGO 
algorithm uses the same idea of iterative incorporation of the obtained information 
to the prior on the function, but with a different criterion. 

The computation of the conditional entropy using (10) requires the choice of a 
quantization operator Q. We use the fact that F(x) is conditionally Gaussian with 
known mean and variance (obtained by Kriging), to select a set of possible values 
{yi, . . . ,2/A/}, such that 

P{Fq{x) = y,\Fn = /g) = 1: Vz G [1 : M] . 

By doing so, we choose a different quantization operator Q for each value of x 
to improve the precision with which the empirical mean of the entropy reduction 
over possible evaluation results is computed. This is simply an improved version of 
Monte Carlo integration. Here we used a set of ten possible values (M = 10). 

For each of these possible values (or hypotheses F(x) = yi), pf^^y^ is computed 
using conditional simulations. The conditional entropy is then obtained using (10). 
These operations are carried out on a discrete set of candidate evaluation points 
(see Section 5.2 for some details on the choice of this set), and a new evaluation of 
/ is finally performed at a point that minimizes the conditional entropy. Next, as 
in the EGO algorithm, the covariance parameters are re-estimated and the model 
re-computed. The procedure for the choice of an additional evaluation point is de- 
scribed in Table 1 . 

When the number of additional function evaluations is not specified beforehand, 
we propose to use as a stopping criterion the conditional probability that the global 
minimum of the GP model be no further apart of the current minimum /^j^ of the 
Kriging interpolation than a given tolerance threshold 6. The algorithm then stops 
when 

P(i^*</min + 5|i"s = /s)<^Stop, 
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Algorithm 

Input: Set S = {xi, . . . , Xn} of evaluation points and corresponding values /§ of the 
function / 

Output: Additional evaluation point a;ncw 

1 . Choose G, a discrete representation of X 

2. Set covariance parameters either a priori or by maximum-likelihood estimation based 
on /§ 

3. Compute r non-conditional simulations over G 

4. Compute f{x) and ct{x) over G by Kriging from /§ 

5. while the set of candidate points has not been entirely explored 

6. do Take an untried point x^ in the set of candidate points 

7. Compute the parameters {yi, . . . , i/m} of the quantization operator Q 

8. Compute A = [A(a;i), . . . , X{xn)] the matrix of Kiiging coefficients at every 
point in G based on evaluation points in S and Xc 

9. for i ^ 1 to M 

10. do Construct conditional simulations using (7) and assuming that f{xc) = 

Vi 

11. Find a global minimizer x^, of the A;th conditional simulation over G 
{k = l,...,r) 

12. Estimate Px*\fs,y, over G using (8) 

13. Compute //(X* |F§ = /§, Fq{xc) = yi) 

14. Compute the conditional entropy given an evaluation at x^ using (10) 

15. Output a^ncw that minimizes the conditional entropy over the set of candidate points 

Table 1 

Selection of a new evaluation point for /. 

with F* = miiUxeG F(x), and Pstop ^ [0, 1] a critical value to be chosen by 
the user. Proposed by Schonlau (Schonlau [1997]), this stopping criterion is well 
suited here, since evaluating the repartition function of f{x*) does not require any 
additional computation. 

5.2 Computational complexity 

With the previous notation, n the number of evaluation points, r the number of 
conditional simulations, N the number of points in G and M number of possible 
results for an evaluation, the computational complexity for the approximation of 
conditional entropy (Steps 7 to 14 in Table 1) is as follows: 

• computing Kriging coefficients at every point in G (Step 8): 0(nN) (once the co- 
variance matrix in (17) has been factorized, Kriging at an untried point is simply 
inO(n)), 

• constructing conditional simulations (Step 10): 0{nrN) (M is actually not in- 
volved since the main part of conditioning can be carried out outside the loop on 
the possible evaluation values), 

• locating the global minimizers for each simulation by exhaustive search (Step 11): 
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0{rNM), 



Since all other operations are in O(A^) at most, evaluating conditional entropy at 
any given point requires O(A^) operations. 

To complete the description of an implementable algorithm, we must specify a 
choice for G and a policy for the minimization of conditional entropy. What follows 
is just an example of a possible strategy, and many variants could be considered. 

The simplest choice for G is a uniform grid on X. However, as the number of 
evaluations of / increases, the spread of Px*\fs diminishes along with the precision 
for the computation of the entropy. To keep a satisfactory precision over time, G can 
be a random sample of points in X, re-sampled after every evaluation of / with the 
density Px*|/s- Re-sampling makes it possible to use a set G with a smaller cardinal 
and to escape, at least partly, the curse of dimensionality (to resample using Px*jfs, 
any non-parametric density estimator could be used along with a sampling method 
such as Metropolis Hastings, see, e.g., (Chib and Greenberg [1995])). 

Ideally, to choose an additional evaluation point for / using lAGO, conditional 
entropy should be minimized over X. However, this of course is in itself a global 
optimization problem, with many local optima. It would be possible to design an ad- 
hoc optimization method (as in (Jones [2001])), but this perspective is not explored 
here. Instead, we evaluate the criterion extensively over a chosen set of candidate 
points. Note that only the surrogate model is involved at this stage, which makes 
the approach practical. The idea is, exactly as for the choice of G, to use a space- 
filling sample covering X and resampled after each new evaluation. The current 
implementation of lAGO simply uses a Latin Hyper Cube (LHC) sample, however, 
it would be easy to adapt this sample iteratively using the conditional distribution 
of the minimizers Px^\fs ^ prior. For instance, areas of the design space where 
the density is sufficiently small could be ignored. After a few evaluations, a large 
portion of the design space usually satisfies this property, and the computations 
saved could be used to improve knowledge on the criterion by sampling where 
Px*\fs is high (using the same approach as for the choice of G). 

As dimension increases, trying to cover the factor space while keeping the same 
accuracy leads to an exponential increase in complexity. However, in a context of 
expensive function evaluation, the objective is less to specify exactly all global 
minimizers (which could be too demanding in function evaluations anyway), than 
to use available information efficiently to reduce the likely areas for the location 
of these minimizers. This is exactly the driving concept behind lAGO. In practice, 
within a set of one thousand candidate points, picking an additional evaluation point 
requires about five minutes with a standard personal computer (and this figure is 
relatively independent of the dimension of the factor space). Moreover, the result 
obtained can be trusted to be a consistent choice within this set of candidate points, 
in regard of what has been assumed and learned about /. 
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5.3 Taking noise in account 



Practical optimization problems often involve noise. This section discusses possible 
adaptations of the optimization algorithm that make it possible to deal with noisy 
situations, namely noise on the evaluation of / and noise on the factors. 



5.3.1 Noise on the evaluation of f 

When the evaluations of / are corrupted by noise, the algorithm must take this fact 
into account. A useful tool to deal with such situations is non-interpolative Kriging 
(see Section 8.2). 

If the evaluation at Xi E § is assumed to be corrupted by an additive Gaussian 
noise £i with known mean and variance, the Kriging prediction should no longer 
be interpolative (see Section 8.2). The optimization algorithm remains nearly un- 
changed, except for the conditional simulations. We now wish to build sample paths 
of F conditionally to evaluation results, i.e. realizations of the random variables 
f{xi) + Ei for Xi E S. Since the variance of the prediction error is no longer zero 
at evaluation points (in other words, there is some uncertainty left on the values of 
/ at evaluation points), we first have to sample, at each evaluation points, from the 
distribution of F conditionally to noisy evaluation results. An interpolative simula- 
tion, based on these sample values, is then built using conditioning by Kriging. An 
example of such a simulation is proposed on Figure 5. 



5.3.2 Noise on the factors 

In many industrial design problems, the variability of the values of the factors in 
mass production has a significant impact on the performance that can be achieved. 
In such a case, one might want to design a system that optimizes some performance 
measure while ensuring that the performance uncertainty (stemming from noise on 
the factors) remains under control. These so-called robust optimization problems 
can generally be written as 

argmin J(x) , (11) 

a; G D 

with J{x) a cost function reflecting some statistical property of the corrupted per- 
formance measure f{x + s), where e is a random vector accounting for noise on 
the factors. Classical cost functions are: 

• mean: J{x) = Ee[/(a3 + e)], 

• standard deviation: J{x) = a^lf^x + e)] = var(/(a; + e))), 

• linear combination of mean and standard deviation: J{x) = Ee[/(a; + s)] + 
kae[f{x + e)]. 
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Fig. 5. Example of prediction by Kriging (bold line) of noisy measurements represented 
by squai^es. Dashed lines represent 95% confidence regions for the prediction and the thin 
solid line is an example of conditional simulation obtained using the method presented in 
Section 5.3.1 (crosses represent the simulated measurements.) 



a-quantile: J{x) = Q"'{x) 
with Q"(a3) such that P{f{x 



e) < Q'^ix)) = a. 



Using, for example, the a-quantile as a cost function, it is possible to adapt our 
optimization algorithm to solve (11). Given a set of evaluation results /§ at noise- 
free evaluation points, and assuming that it is possible to sample from the dis- 
tribution pe of £, a Monte-Carlo approximation Q°'{x) of Q°'{x) is easily ob- 
tained by computing f(x + e) over a set sampled from p^. The global optimization 
algorithm can then be applied to Q°'{x) instead of /, using pseudo-evaluations 
= [Q"(a^i), . . . , Q"(aJn)] instead of U 



It is of course possible to combine these ideas and to deal simultaneously with noise 
both on the factors and the function evaluations. 



6 Illustrations 



This section contains some simple examples of global optimization using lAGO, 
with a regular grid as a set of candidate evaluation points. An empirical compar- 
ison with global optimization using expected improvement is also presented. The 



16 



Matem covariance class will be used for Kriging prediction, as it facilitates the tun- 
ing of the variance, regularity and range of correlation of the underlying random 
process, but note that any kind of admissible covariance could have been used. The 
parameters of the covariance may be estimated from the data using a maximum- 
likelihood approach (see Section 8.3). 

6.1 A one-dimensional example 

Consider the function with two global minimizers illustrated by Figure 6 and de- 
fined by / : X I — ^ 4[1 — sin(x + 8 exp(x — 7))]. Given an initial design consisting 
of three points, the lAGO algorithm is used to compute six additional points itera- 
tively. The final Kriging model is depicted in the left part of Figure 6, along with 
the resulting point mass conditional distribution for the minimizer on the right part. 
After adding some noise on the function evaluations, the variant of the algorithm 
presented in Section 5.3.1 is also applied to the function with the same initial de- 
sign. In both cases, the six additional evaluations have significantly reduced the 
uncertainty associated with the position of the global minimizers. The remaining 
likely locations reduce to two small areas centered on the two actual global min- 
imizers. In the noisy case, larger zones are identified, a direct consequence of the 
uncertainty associated with the evaluations. 

Figure 7 illustrates robust optimization using the same function and initial design. 
The cost function used is the 90%-quantile which is computed on the surro- 

gate model but also, and only for the sake of comparison, on the true function using 
Monte Carlo uncertainty propagation (the quantile is approximated using 5000 sim- 
ulations). After six iterations of the robust optimization algorithm, the distribution 
of the robust minimizer is sufficiently peaked to give a good approximation of the 
true global robust minimizer. 

These result are encouraging as they show that the requirement of a fast uncertainty 
reduction is met. The next section provides some more examples, along with a 
comparison with EGO, the El-based global optimization algorithm. 

6.2 Empirical comparison with expected improvement 

Consider first the function described by Figure 8. Given an initial design of three 
points, both the EI and conditional entropy are computed. Their optimization pro- 
vides two candidate evaluation point for /, which are also presented on Figure 8, 
along with the post-evaluation prediction and conditional point mass distribution 
for X^. For this example, the regularity parameter of the Matern covariance is set 
a priori to a high value (2.5), and it is therefore likely that two evaluations close to 
one another, as proposed by the lAGO algorithm, will give some valuable informa- 
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-1 

2 4 6 2 4 6 

(a) Kriging prediction and point mass distribution of the global minimizers 
based on the initial design 




2 4 6 2 4 

(b) Standai^d lAGO algorithm (noise free case) 




2 4 60 2 4 6 



(c) lAGO algorithm for noisy evaluations (the additive noise is zero-mean 
Gaussian with standard deviation 0.2) 

Fig. 6. Example of global optimization using lAGO on a function of one variable (dotted 
line), with an initial design consisting of three points (represented by squares). Six addi- 
tional evaluations are canied out (triangles) using two versions of the lAGO algorithm. 
The graphs on the left part of the figure account for final predictions, while the right part 
presents the final point mass distributions of the global minimizers 

tion about the part of X located on the left of the evaluation point, while improving 
the characterization of the global minimizer. By taking in account the covariance 
of F through conditional simulations, the conditional entropy uses regularity to 
conclude faster. The resulting conditional point mass distribution of the minimiz- 
ers is then generally more peaked using the lAGO algorithm than using the EGO 
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L I 

2 4 60 2 4 6 

Fig. 7. Example of robust optimization using lAGO and the cost function Q^^'^". The func- 
tion / (dotted line), corrupted by a Gaussian noise on the factor (zero mean with a standard 
deviation of 0.2), is studied starting from an initial design of three points (as in Figure 6). 
Six additional evaluations are earned out (triangles), which are used to estimate the cost 
function based on the Kiiging model (bold line), along with the estimated point mass dis- 
tribution of the robust minimizers (right). The cost function Q^'^^" estimated, only for the 
sake of comparison, from the true function using Monte Carlo uncertainty propagation is 
also provided (mixed line). 

algorithm (as illustrated by Figure 8(c) and Figure 8(b)). 

Consider now the Branin function (see, for instance, (Dixon and Szego [1978])), 
defined as 

/ : [-5, 10] X [0, 15] — > R 

+ 10 (l - ^) cos(xi) + 10 . 

It has three global minimizers xl ^ (-3.14,12.27)"^, ^ (3.14,2.27)"^ and 
xl ^ (9.42, 2.47)^, and the global minimum is approximately equal to 0.4. Given 
an initial uniform design of fifteen points, fifteen additional points are iteratively 
selected and evaluated using the lAGO and EGO algorithms. These parameters are 
estimated on the initial design, and kept unchanged during both procedures. The po- 
sitions of these points are presented on Figure 9 (left), along with the three global 
minimizers. Table 2 summarizes the results obtained with EGO and lAGO, based 
on the final Kriging models obtained with both approaches. Note that the EI crite- 
rion in EGO is maximized with a high precision, while the conditional entropy in 
lAGO is computed over a thousand candidate evaluation points located on a regu- 
lar grid. It appears nevertheless that the algorithm using EI stalls on a single global 
minimizer, while the conditional entropy allows a relatively fast estimation of all 
three of them. Besides, the search is more global with lAGO, which yields a bet- 
ter global approximation of the supposedly unknown function. If twenty additional 
evaluations are carried out (as presented in the right part of Figure 9), the final 
Kriging prediction using the SUR approach estimates the minimum with an error 
of less than 0.05 for all three minimizers (cf. Table 2), while the EI criterion does 
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(a) Initial prediction and density of the global minimizers 




(b) Prediction and density of the global minimizers after an additional evalua- 
tion of / chosen with EI 




(c) Prediction and density of the global minimizers after an additional evalua- 
tion of / chosen with conditional entropy 

Fig. 8. Comparison between conditional entropy and EI: the left side contains the Kriging 
predictions before and after an additional evaluation chosen with either EI or conditional 
entropy, while the right side presents the coiTcsponding conditional density of the global 
minimizers. 

not improve the information on any minimizer any further. The difference between 
the two strategies is clearly evidenced. The EI criterion, overestimating the confi- 
dence in the initial prediction, has led to performing evaluations extremely close 
to one another, for a very small information gain. In a context of expensive func- 
tion evaluation, this is highly detrimental. The entropy criterion, using the same 
covariance parameters, does not stack points almost at the same location before 
having identified the most likely zones for the minimizers. The use of what has 
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EGO 


lAGO 




15 iterations 


35 iterations 


15 iterations 


35 iterations 


Euclidean distance between x* and 
its final estimate 


3.22 


3.22 


2.18 


0.23 


Value of the true function at esti- 
mated minimizer 


17.95 


17.95 


2.59 


0.40 


Euclidean distance between ajj and 
its final estimate 


2.40 


2.40 


0.44 


0.18 


Value of the true function at esti- 
mated minimizer 


13.00 


13.00 


0.85 


0.42 


Euclidean distance between a;^ and 
its final estimate 


0.04 


0.04 


0.82 


0.23 


Value of the true function at esti- 
mated minimizer 


0.40 


0.40 


1.94 


0.44 



Table 2 

Estimation results for the Branin function using evaluations of Figure 9 



been assumed and learned about the function is clearly more efficient in this case, 
and this property should be highly attractive when dealing with problems of higher 
dimension. 



(a) 15 iterations using EGO 




(b) 35 iterations using EGO 



A14 

A 12 



10" 



All 

A 10 



as 

X 

A6 



a7 



a2 , 
A3*' 



A A 

A A 
A A 



AA 

A-A 
A^A 
A A 



(c) 15 iterations using lAGO 



(d) 35 iterations using lAGO 



Fig. 9. fifteen iterations of two optimization algorithms, that differ by their criteria for se- 
lecting evaluation points for /, on the Branin function: (top) the EI criterion is used, (bot- 
tom) the conditional entropy criterion is used with a thousand candidate evaluation points 
for / set on a regular grid (squares account for initial data, triangles for new evaluations, 
and crosses give the actual locations of the three global minimizers). 
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7 Discussion 

7. 1 Robustness to uncertainty on the covariance parameters 

Jones [2001] studies the potential of Kriging-based global optimization methods 
such as EGO. One of his most important conclusions, is that these methods "can 
perform poorly if the initial sample is highly deceptive'". An eloquent example is 
provided on page 373, where a sinus function is sampled using its own period, 
leading to a flat prediction over the domain, associated with a small prediction 
error. 

This potential for deception is present throughout the lAGO procedure, and should 
not be ignored. To overcome this difficulty, several methods have been proposed 
(see, e.g., the Enhanced Method 4 in (Jones [2001]) or (Gutmann [2001])), which 
achieve some sort of robustness to an underestimation of the prediction error and 
more generally to a bad choice of covariance. They seem to perform better than 
classical algorithms, including EGO. 

Comparing the lAGO approach to such methods is an interesting topic for future 
research. The issue considered here was to demonstrate the interest of the condi- 
tional entropy criterion, and we feel that this should be done independently from 
the rest of the procedure. 

It is of course essential to make the lAGO algorithm robust to errors in the estima- 
tion of the covariance parameters. In many industrial problems, this can be easily 
done by using prior knowledge on the unknown function to restrict the possible val- 
ues for these parameters. For example, experts of the field often have information 
regarding the range of values attainable by the unknown function. This information 
can be directly used to restrict the search space for the variance of the modeling 
process F, or even to choose it beforehand. 

More generally, given the probabilistic framework used here, it should be relatively 
easy to develop a Bayesian or minimax extension of the lAGO algorithm to guide 
the estimation of the parameters of the covariance. A comparison with robust meth- 
ods such as those detailed in (Jones [2001]) will then be essential. 

7.2 Conclusions and perspectives 

In this paper, a stepwise uncertainty reduction strategy has been used for the se- 
quential global optimization of expensive-to-evaluate functions. This strategy iter- 
atively selects a minimizer of the conditional minimizer entropy as the new evalua- 
tion point Xnew To compute this entropy, a Gaussian random model of the function 
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evaluations is used and the minimizer entropy is estimated through Kriging and 
conditional simulations. At each iteration, the result of the evaluation at tCncw is 
incorporated in the data base used to re-build the Kriging model (with a possible 
re-estimation of the parameters of its co variance). 

We have shown on some simple examples that, compared to the classical El-based 
algorithm EGO, the method proposed significantly reduces the evaluation effort in 
the search for global optimizers. The SUR strategy allows the optimization method 
to adapt the type of search to the information available on the function. In particular, 
the conditional entropy criterion makes full use of the assumed regularity of the 
unknown function to balance global and local searches. 

Choosing an adequate set of candidate points is a crucial point, as it must allow a 
good estimation of a global minimizer of the criterion, while keeping computation 
feasible. Promising results have already been obtained with space-filling designs, 
and adaptive sampling based on the conditional density of the global minimizers 
should provide useful results as dimension increases. 

Extension to constrained optimization is an obviously important topic of future in- 
vestigations. When it is easy to discard the candidate points in X that do not satisfy 
the constraints, the extension is trivial. For expensive-to-evaluate constraints, the 
extension is a major challenge. 

Finally, the SUR strategy associated with conditioning by Kriging is a promising 
solution for the robust optimization of expensive-to-evaluate functions, a problem 
that is central to many industrial situations, for which an efficient product design 
must be found in the presence of significant uncertainty on the values actually taken 
by some factors in mass production. In addition, robustness to the uncertainty as- 
sociated with the estimation of the parameters of the covariance should also be 
sought. 



8 Appendix: modeling with Gaussian processes 

This section recalls the main concepts used in this paper, namely Gaussian process 
modeling and Kriging. The major results will be presented along with the general 
framework for the estimation of the model parameters. 

8.1 Kriging when f is evaluated exactly 

Kriging (Matheron [1963], Chiles and Delfiner [1999]) is a prediction method 
based on random processes that can be used to approximate or interpolate data. 
It can also be understood as a kernel regression method, such as splines (Wahba 
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[1998]) or Support Vector Regression (Smola [1998]). It originates from geostatis- 
tics and is widely used in this domain since the 60s. Kriging is also known as the 
Best Linear Unbiased Prediction (BLUP) in statistics, and has been more recently 
designated as Gaussian Processes (GP) in the 90s in the machine learning commu- 
nity. 

As mentioned in Section 2.1, it is assumed that the function / is a sample path of a 
second-order Gaussian random process F. Denote by m(a;) = E[F{x)] the mean 
of F{x) and by k{x^ y) its covariance, written as 

k{x, y) = coy{F{x), F{y)). 

Kriging then computes the BLUP of F{x), denoted by F{x), in the vector space 
generated by the evaluations EI§ = span{F(a;i), . . . , F(a;„)}. As an element of 
M§, F{x) can be written as 

F{x) = X{xyF^, (12) 
As the BLUP, F{x) must have the smallest variance for the prediction error 

a\x) = ^[{F{x)-F{x)fl (13) 
among all unbiased predictors. The variance of the prediction error satisfies 

a'^{x) = k{x, x) + Xix^KXix) - 2A(a;)'^fc(a;), (14) 

with 

K = ikix,,x,)), (z,j)e[l,nf 
the n X n covariance matrix of F at evaluation points in S, and 

k{x) = [k{xi, a;), . . . , k{xn, x)]^ 

the vector of covariances between F{x) and Fg 

The prediction method (Matheron [1969]) assumes that the mean of F{x) can be 
written as a finite linear combination 

m(x) = /3^p{x), 

where /3 is a vector of fixed but unknown coefficients, and 

p{x) = [pi{x),...,pi{x)]^ 

is a vector of known functions of the factor vector x. Usually these functions are 
monomials of low degree in the components of x (in practice, the degree does not 
exceed two). These functions may be used to reflect some prior knowledge on the 
unknown function. As we have none for the examples considered here, we simply 
use an unknown constant. 
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The Kriging predictor at x is then the best linear predictor subject to the unbi- 
asedness constraint K{F(x)) = m(x), whatever the unknown (3. The unbiasedness 
constraint translates into 



f3''P^X{x)=f3''p{x), 



(15) 



with 



For (15) to be satisfied for all /3, the Kriging coefficients must satisfy the linear 
constraints 

P''X{x)=p{x), (16) 
called universality constraints by Matheron. At this point, Kriging can be reformu- 
lated as follows: find the vector of Kriging coefficients that minimizes the variance 
of the prediction error (14) subject to the constraints (16). This problem can be 
solved via a Lagrangian formulation, with n{x) a vector of / Lagrange multipliers. 
The coefficients X{x) are then solutions of the linear system of equations 



(17) 



with a matrix of zeros. A convenient expression for the variance of the prediction 
error is obtained by substituting k(x) — Pfi(x) for KX(x) in (14) as justified by 
(17), to get 





E 



F{x) — F{x) = k{x, x) — X{x y k{x) - p{xy ti{x) . (18) 



The variance of the prediction error at x can thus be computed without any evalu- 
ation of /, using (17) and (18). It provides a measure of the quality associated with 
the Kriging prediction. Evaluations of / remain needed to estimate the parameters 
of the covariance of F (if any), as will be seen in Section 8.3.2. 

Once / has been evaluated at all evaluation points, the prediction of the value taken 
by / at a; becomes 

fix) = Xix^fs , (19) 
with /§ = [f{xi), . . . , f{xn)]^ (fs is viewed as a sample value of F§). 

It is easy to check that (17) implies that 

Va;, G §, F{Xi) = F{Xi). 

The prediction of f at Xi E E> is then f{xi), so Kriging is an interpolation with 
the considerable advantage that it also accounts for model uncertainty through an 
explicit characterization of the prediction error. 
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Remark: The Bayesian framework (see, for instance, Williams and Rasmussen 
[1996]) is an alternative approach to derive the BLUP, in which the model F is 
viewed as a Bayesian prior on the output. In the case of a zero-mean model, the 
conditional distribution of the function is then Gaussian with mean 

E [F{x)\ Fs = h] = k{xyK''fs, (20) 

and variance 

Var [F{x)\ Fs = fs] = k{x, x) - k{xyK-'k{x), 

which are exactly the mean (19) and variance (18) of the Kriging predictor for a 
model F with zero mean. The Kriging predictor can also be viewed as the condi- 
tional mean of F(x) in the case of an unknown mean, if the universality constraints 
are viewed as a non-informative prior on /3. 



8.2 Kriging when f is evaluated approximately 



The Kriging predictor was previously defined as the element of the space Elg gen- 
erated by the random variables F(xi) that minimizes the prediction error. A natural 
step is to extend this formulation to the case of a function whose evaluations are cor- 
rupted by additive independent and identically distributed Gaussian noise variables 
Ej with zero mean and variance cr?. The model of the observations then becomes 



nobs 



Fix, 



+ eii 



Fix) = MxVF^^' with F^^^^ 



n, and the Kriging predictor for F{x) takes the form 

T 



pobs 



7obs 



. The unbiasedness constraint 



(16) remain unchanged, while the mean-square error (2) becomes 

^[F{x) - F{x)f = kix, x) + Xix^iK + a^Jn)Mx) - 2\{xyk{x), 

with In the identity matrix. Finally, using Lagrange multipliers as before, it is easy 
to show that the coefficients \{x) of the prediction must satisfy 





(21) 



The resulting prediction is no longer interpolative, but can still be viewed as the 
mean of the conditional distribution of F. The variance of the prediction error is 
again obtained using (18). 



8.3 Covariance choice 



Choosing a suitable covariance function k{-,-) for a given / is a recurrent and 
fundamental question. It involves the choice of a parametrized class (or model) of 
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covariance, and the estimation of its parameters. 

8.3.1 Covariance classes 

The asymptotic theory of Kriging (Stein [1999]) stresses the importance of the 
behaviour of the covariance near the origin. This behaviour is indeed linked with 
the quadratic-mean regularity of the random process. For instance, if the covari- 
ance is continuous at the origin, then the process will be continuous in quadratic 
mean. In practice, one often uses covariances that are invariant by translation (or 
equivalently stationary), isotropic, and such that regularity can be adjusted. Non- 
stationary covariances are seldom used in practice, as they make parameter esti- 
mation particularly difficult (Chiles and Delfiner [1999]). Isotropy, however, is not 
required and can even be inappropriate when the factors are of different natures. 
An example of an anisotropic, stationary covariance class is k(x, y) = k{h), with 

h = \l — yY A{x — y) where {x, y) G and A is a symmetric positive defi- 
nite matrix. 

A number of covariance classes are classically used (for instance exponential h ^ 
cr^ exp(— 6'|/i|°), product of exponentials, or polynomial). The Matern covariance 
class offers the possibility to adjust regularity with a single parameter (Stein [1999]). 
The Fourier transform of a Matern covariance is 

where u controls the decay oi k{uj) at infinity and therefore the regularity of the co- 
variance at the origin. Stein [1999]) advocates the use of the following parametriza- 
tion of the Matern class: 

where Ky is the modified Bessel function of the second kind (Yaglom [1986]). This 
parameterization is easy to interpret, as u controls regularity, is the variance 
(A;(0) = cr^), and p represents the range of the covariance, i.e., the characteris- 
tic correlation distance. To stress the significance and relevance of the regularity 
parameter. Figure 10 shows the influence of u on the covariance, and Figure 11 
demonstrates its impact on the sample paths. Since Kriging assumes that / is a 
sample path of F, a careful choice of the parameters of the covariance is essential. 

8.3.2 Covariance parameters 

The parameters for a given covariance class can either be fixed using prior knowl- 
edge on the system, or be estimated from experimental data. Li geostatistics, es- 
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Fig. 10. Matern covariances for the parameterization (22) with p = 0.5, o"^ = 1. Solid line 
coiTesponds to = 4, dashed line to v = 1 and dotted line to = 0.25. 




Fig. 11. Three sample paths of a zero-mean Gaussian process with a Matem covariance. 
Conventions are as in Figure 10: = 4 for the solid line, z> = 1 for the dashed line and 
V = 0.25 for the dotted line. 
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timation is carried out using the adequacy between the empirical and model co- 
variances (Chiles and Delfiner [1999]). In other areas, cross validation (Wahba 
[1998]) and maximum likelihood (Stein [1999]) are mostly employed. For sim- 
plicity and generality reasons (Stein [1999]), the maximum-likelihood method is 
preferred here. 

The maximum-likelihood estimate of the parameters of the covariance maximizes 
the probability density of the data. Using the joint probability density of the ob- 
served Gaussian vector, and assuming that the mean of F{x) is zero for the sake of 
simplicity, the maximum-likelihood estimate of the vector 6 of the covariance pa- 
rameters is obtained (see, for instance, Vechia [1998]) by minimizing the negative 
log-likelihood 

n 1 1 

1(6) = -log27r + -logdetK(0) + -fjK{ey'f^, (24) 

In the case of an unknown mean for F{x), it is possible to estimate the parameters, 
using for example the REstricted Maximum Likelihood (REML, see Stein [1999]). 
This is the approach used for the examples in this paper. 

Figure 12 illustrates prediction by Kriging with a Matem covariance, the parame- 
ters of which have been estimated by REML. The prediction interpolates the data, 
and confidence intervals are deduced from the square root of the variance of the 
prediction error to assess the quality of the prediction between data. Figure 12 also 
contains a series of conditional simulations (obtained with the method explained 
in Section 3.2), namely sample paths of F that interpolate the data. As implied by 
(20), the Kriging prediction is the mean of these conditional simulations. 
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Fig. 12. Example of Kriging interpolation (bold line) for a function of one variable. The 
data are represented by squares, and the covariance parameters were estimated by REML. 
Dashed lines delimit 95% confidence region for the prediction. The thin solid lines are 
examples of conditional simulations. 
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